Dennis Chandler
14 June, 2018
R Enthusiast
St. Louis, MO
[1] "Violence is the last refuge of the incompetent"
[2] "The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom"
[3] "People who think they know everything are a great annoyance to those of us who do"
[4] "The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' (I've found it!), but 'That's funny...'"
library(tidytext); library(tidyverse)
text_df <- data_frame(line = 1:4, text = text)
text_df <- unnest_tokens(text_df, word, text)
text_df
# A tibble: 64 x 2
line word
<int> <chr>
1 1 violence
2 1 is
3 1 the
4 1 last
5 1 refuge
6 1 of
7 1 the
8 1 incompetent
9 2 the
10 2 saddest
# ... with 54 more rows
text_df %>% count(word, sort = TRUE)
# A tibble: 51 x 2
word n
<chr> <int>
1 the 5
2 is 3
3 of 3
4 gathers 2
5 science 2
6 that 2
7 to 2
8 who 2
9 a 1
10 annoyance 1
# ... with 41 more rows
text_df <- data_frame(line = 1:4, text = text)
text_df <- unnest_tokens(text_df, ngram, text, token = "ngrams", n =3)
text_df
# A tibble: 56 x 2
line ngram
<int> <chr>
1 1 violence is the
2 1 is the last
3 1 the last refuge
4 1 last refuge of
5 1 refuge of the
6 1 of the incompetent
7 2 the saddest aspect
8 2 saddest aspect of
9 2 aspect of life
10 2 of life right
# ... with 46 more rows
Built in stop_words
Use dplyr's anti_join or use other techniques
# A tibble: 3 x 2
lexicon n
<chr> <int>
1 onix 404
2 SMART 571
3 snowball 174
library(gutenbergr); library(stringr)
novel <- gutenberg_download(36)
chapter <- mutate(novel, linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter ", ignore_case = TRUE))))
tail(chapter)
# A tibble: 6 x 4
gutenberg_id text linenumber chapter
<int> <chr> <int> <int>
1 36 the tumult of playing children, and to ~ 6469 27
2 36 all bright and clear-cut, hard and sile~ 6470 27
3 36 great day. . . . 6471 27
4 36 "" 6472 27
5 36 And strangest of all is it to hold my w~ 6473 27
6 36 that I have counted her, and that she h~ 6474 27
token_novel <- unnest_tokens(chapter, word, text)
tail(token_novel, 15)
# A tibble: 15 x 4
gutenberg_id linenumber chapter word
<int> <int> <int> <chr>
1 36 6473 27 think
2 36 6474 27 that
3 36 6474 27 i
4 36 6474 27 have
5 36 6474 27 counted
6 36 6474 27 her
7 36 6474 27 and
8 36 6474 27 that
9 36 6474 27 she
10 36 6474 27 has
11 36 6474 27 counted
12 36 6474 27 me
13 36 6474 27 among
14 36 6474 27 the
15 36 6474 27 dead
Built in sentiment lexicons
Use dplyr's inner_join or use other techniques
# A tibble: 4 x 2
lexicon n
<chr> <int>
1 AFINN 2476
2 bing 6788
3 loughran 4149
4 nrc 13901
WOTW_sentiment <- token_novel %>% inner_join(get_sentiments("bing")) %>%
count(index = linenumber %/% 80, sentiment) %>% spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
ggplot(WOTW_sentiment, aes(index, sentiment, fill = 'red')) +
geom_col(show.legend = FALSE)
token_novel <- unnest_tokens(chapter, word, text) %>%
count(chapter, word, sort = TRUE) %>% ungroup()
chapter_novel <- bind_tf_idf(token_novel, word, chapter, n)
chapter_novel <- arrange(chapter_novel, desc(tf_idf))
chapter_novel[10:20,]
# A tibble: 11 x 6
chapter word n tf idf tf_idf
<int> <chr> <int> <dbl> <dbl> <dbl>
1 0 wells 1 0.0196 3.33 0.0653
2 0 anatomy 1 0.0196 2.64 0.0517
3 0 melancholy 1 0.0196 2.64 0.0517
4 0 book 1 0.0196 1.95 0.0382
5 0 shall 1 0.0196 1.54 0.0302
6 25 ulla 28 0.00901 3.33 0.0300
7 0 war 1 0.0196 1.25 0.0246
8 16 brother 50 0.0113 1.72 0.0195
9 0 are 2 0.0392 0.442 0.0173
10 23 weed 11 0.00873 1.54 0.0134
11 6 mirror 4 0.00474 2.64 0.0125
cast_dfm( ) -> Document-Feature Matrix (quanteda package)
Can just cast_sparse( ) for generic sparse matrix
library(bigrquery)
library(tidyverse)
project <- "my-first-project-184914"
sql <- "#legacySQL
SELECT
stories.title AS title,
stories.text AS text
FROM
[bigquery-public-data:hacker_news.full] AS stories
WHERE
stories.deleted IS NULL
LIMIT
250000"
hacker_news_raw <- query_exec(sql, project = project, max_pages = Inf)
unigram_probs <- hacker_news_text %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE) %>%
mutate(p = n / sum(n))
unigram_probs
# A tibble: 196,648 x 3
word n p
<chr> <int> <dbl>
1 the 547784 0.0404
2 to 380822 0.0281
3 a 335134 0.0247
4 of 269871 0.0199
5 and 264712 0.0195
6 i 232058 0.0171
7 x27 219078 0.0162
8 is 215909 0.0159
9 that 213736 0.0158
10 it 201173 0.0148
# ... with 196,638 more rows
library(widyr)
tidy_skipgrams <- hacker_news_text %>%
unnest_tokens(ngram, text, token = "ngrams", n = 8) %>%
mutate(ngramID = row_number()) %>%
unite(skipgramID, postID, ngramID) %>%
unnest_tokens(word, ngram)
skipgram_probs <- tidy_skipgrams %>%
pairwise_count(word, skipgramID, diag = TRUE, sort = TRUE) %>%
mutate(p = n / sum(n))
library(widyr)
tidy_skipgrams <- hacker_news_text %>%
unnest_tokens(ngram, text, token = "ngrams", n = 8) %>%
mutate(ngramID = row_number()) %>%
unite(skipgramID, postID, ngramID) %>%
unnest_tokens(word, ngram)
skipgram_probs <- tidy_skipgrams %>%
pairwise_count(word, skipgramID, diag = TRUE, sort = TRUE) %>%
mutate(p = n / sum(n))
library(widyr)
tidy_skipgrams <- hacker_news_text %>%
unnest_tokens(ngram, text, token = "ngrams", n = 8) %>%
mutate(ngramID = row_number()) %>%
unite(skipgramID, postID, ngramID) %>%
unnest_tokens(word, ngram)
skipgram_probs <- tidy_skipgrams %>%
pairwise_count(word, skipgramID, diag = TRUE, sort = TRUE) %>%
mutate(p = n / sum(n))
normalized_prob <- skipgram_probs %>%
filter(n > 20) %>%
rename(word1 = item1, word2 = item2) %>%
left_join(unigram_probs %>%
select(word1 = word, p1 = p), by = "word1") %>%
left_join(unigram_probs %>%
select(word2 = word, p2 = p), by = "word2") %>%
mutate(p_together = p / p1 / p2
pmi_matrix <- normalized_prob %>%
mutate(pmi = log10(p_together)) %>%
cast_sparse(word1, word2, pmi)
library(irlba)
pmi_svd <- irlba(pmi_matrix, 256, maxit = 1e3)
library(irlba)
pmi_svd <- irlba(pmi_matrix, 256, maxit = 1e3)
library(irlba)
pmi_svd <- irlba(pmi_matrix, 256, maxit = 1e3)
word_vectors <- pmi_svd$u
rownames(word_vectors) <- rownames(pmi_matrix)
V1 V2 V3 V4 V5
the -0.018315653 0.004973146 -0.038777857 -0.006393766 -0.02050269
to -0.041313708 0.018954042 -0.077170174 -0.021065735 -0.01003631
a -0.019219855 0.019526813 -0.020567201 -0.031538382 0.01400233
and 0.019921075 -0.039796364 -0.036157085 0.047092081 -0.01950567
of -0.001201721 -0.010269622 -0.052927814 0.025474243 -0.04899882
that -0.050997499 0.012205634 -0.074842144 -0.020170042 -0.01287471
x27 -0.109807935 0.035382684 -0.044326596 -0.069429378 0.01867271
is -0.036763421 0.008950814 -0.068116216 -0.036233972 0.01980432
i -0.114051408 0.047364811 0.005171922 -0.074395070 0.03745608
it -0.091467524 0.033111807 -0.045517237 -0.055687987 0.02135267
library(broom)
search_synonyms <- function(word_vectors, selected_vector) {
similarities <- word_vectors %*% selected_vector %>%
tidy() %>%
rename(token = .rownames,
similarity = unrowname.x.)
similarities %>%
arrange(-similarity)
}
facebook <- search_synonyms(word_vectors, word_vectors["facebook",])
head(facebook, 10)
token similarity
1 facebook 0.07303870
2 google 0.05226809
3 twitter 0.05034749
4 social 0.04780861
5 account 0.03941855
6 fb 0.03551794
7 etc 0.02816255
8 app 0.02742218
9 instagram 0.02501199
10 login 0.02378864
haskell <- search_synonyms(word_vectors, word_vectors["haskell",])
head(haskell, 10)
token similarity
1 haskell 0.05428521
2 lisp 0.04642205
3 languages 0.04387561
4 functional 0.04087069
5 clojure 0.03778325
6 scala 0.03710879
7 language 0.03533238
8 programming 0.02877087
9 erlang 0.02774153
10 java 0.02593245
mystery_product <- word_vectors["iphone",] - word_vectors["apple",] + word_vectors["microsoft",]
head(search_synonyms(word_vectors, mystery_product), 20)
token similarity
1 windows 0.02721096
2 7 0.02402575
3 iphone 0.02281845
4 phone 0.02255886
5 office 0.01802380
6 mobile 0.01756828
7 6 0.01734562
8 cell 0.01715644
9 android 0.01678512
10 8 0.01613067
11 anti 0.01603111
12 projects 0.01434070
13 use 0.01397506
14 phones 0.01358735
15 battery 0.01350073
16 microsoft 0.01330761
17 nexus 0.01313012
18 new 0.01286196
19 desktop 0.01254406
20 home 0.01240648
mystery_product <- word_vectors["iphone",] - word_vectors["apple",] + word_vectors["amazon",]
head(search_synonyms(word_vectors, mystery_product),20)
token similarity
1 amazon 0.06712728
2 aws 0.05199616
3 s3 0.03739996
4 ec2 0.03384358
5 book 0.03358123
6 services 0.03220372
7 books 0.03123806
8 service 0.02875673
9 cloud 0.02774119
10 online 0.02604089
11 price 0.02549487
12 kindle 0.02532448
13 hosting 0.02355045
14 storage 0.02234644
15 card 0.02234487
16 docker 0.02214886
17 heroku 0.02081644
18 6 0.02049478
19 iphone 0.02035172
20 rds 0.02021610
Dynamic Word Embeddings for Evolving Semantic Discovery
Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, Hui Xiong
(Submitted on 2 Mar 2017 (v1), last revised 13 Feb 2018 (this version, v2))
Word evolution refers to the changing meanings and associations of words throughout time, as a byproduct of human language evolution. By studying word evolution, we can infer social trends and language constructs over different periods of human history. However, traditional techniques such as word representation learning do not adequately capture the evolving language structure and vocabulary. In this paper, we develop a dynamic statistical model to learn time-aware word vector representation. We propose a model that simultaneously learns time-aware embeddings and solves the resulting "alignment problem". This model is trained on a crawled NYTimes dataset. Additionally, we develop multiple intuitive evaluation strategies of temporal word embeddings. Our qualitative and quantitative tests indicate that our method not only reliably captures this evolution over time, but also consistently outperforms state-of-the-art temporal embedding approaches on both semantic accuracy and alignment quality.
Comments: 9 pages, published in the International Conference on Web Search and Data Mining (WSDM 2018)
Subjects: Computation and Language (cs.CL); Machine Learning (stat.ML)
DOI: 10.1145/3159652.3159703
Cite as: arXiv:1703.00607 [cs.CL]
hosvd() from rTensor Package??
Thank You